home *** CD-ROM | disk | FTP | other *** search
- (Message net/comp/text:5097)
- Path: math.lsa.umich.edu!zaphod.mps.ohio-state.edu!samsung!uunet!tut.cis.ohio-state.edu!xylophone.cis.ohio-state.edu!jbarnes
- From: jbarnes@xylophone.cis.ohio-state.edu (Julie Ann Barnes)
- Newsgroups: comp.text
- Subject: new tech report
- Message-ID: <82358@tut.cis.ohio-state.edu>
- Date: 23 Jul 90 15:28:23 GMT
- Sender: news@tut.cis.ohio-state.edu
- Reply-To: <jbarnes@cis.ohio-state.edu>
- Organization: Ohio State University Computer and Information Science
- Lines: 60
-
- We have recently published the following technical report:
-
- Analysis of Document Encoding Schemes: A General Model and Retagging
- Toolset
- Julie Barnes
- OSU-CISRC-7/90-TR19, July, 1990, 69 pp.
-
- If you would like a copy, you may send the request via email to
-
- strawser@cis.ohio-state.edu
-
- Please include your postal mailing address.
-
-
- ABSTRACT
-
- Many document encoding schemes and software applications to process
- electronically encoded documents exist today. The plethora of schemes
- complicates the development of applications that must access documents
- in more than one representation. A uniform representation of
- electronic documents would greatly facilitate software development.
-
- Unfortunately, the retagging of existing electronic documents is
- difficult, given the current development tools. The fundamental
- problem of distinguishing the markup from the text strings is
- complicated by problems such as context-sensitive markup, implicit
- markup, white space, and the matching of start and end tags.
- Lexical-analyzer generators such as Lex are based on formal models
- that are inadequate to handle these problems. Because of this, much
- of the retagging code must be written by hand.
-
- Based on a generalization of these problems, we develop a new model
- for textual data objects with embedded markup. The new model for
- textual data objects is based on the relationships between markup and
- text strings. The model includes four classes of markup strings:
- symbol, nonsymbol, implicit segmenting, and explicit segmenting tags.
-
- We propose a uniform representation called a Lexical Intermediate Form
- with the following lexical properties: 1) the tags are easy to
- distinguish from the text, 2) the tags are unambiguous, and 3) the
- tags are explicit. The LIF borrows its concrete syntax from the ISO
- standard SGML, but it is not encumbered with the SGML concept of
- document-type definitions.
-
- Based on the model and the proposed LIF, we identify two steps in the
- retagging process and develop software tools that automatically
- generate the code for each of these steps. Experiences using the
- toolset are described for six encoding schemes of varying complexity:
- the Thesaurus Linguae Graecae, the Dictionary of the Old Spanish
- Language, the Lancaster-Oslo/Bergen Corpus, the Oxford Concordance
- Program, WATCON-2, and Scribe. Use of the toolset represents a
- savings in coding effort ranging from 4.3 to 23.2 lines of code
- generated per line of specification in the toolset. Approximately 98
- per cent of the retagging code for these encoding schemes was
- automatically generated by the toolset.
- -=-
- Julie A. Barnes Department of Computer and Information Science
- jbarnes@cis.ohio-state.edu The Ohio State University
- 2036 Neil Ave.
- Columbus, OH USA 43210-1277
-